Week 7.4 - Verification of AI-Generated Code

What We'll Cover

This is the most critical lesson of the week. The previous sub-lessons showed you how AI can help with data analysis, visualisation, and translating research questions into code. This lesson is about the part that matters most: making sure the code is doing the right analysis, not just an analysis.

A word of honest context first. Modern agentic tools like Claude Code have largely solved the problem of code that crashes. They run the code, read the error, fix it, and iterate — often without any human intervention. If your script fails, the tool handles it. That is genuinely impressive, and it means one entire category of verification work has been automated away.

What remains — and what this lesson is about — is the deeper category of errors that no amount of autonomous iteration can catch: code that runs perfectly but uses the wrong statistical test for your data type, the wrong variable because two columns have similar names, the wrong aggregation logic because the AI did not know your unit of analysis. These errors produce polished, professional-looking results. They will not trigger an error message. They will not cause the tool to retry. They will just give you the wrong answer, beautifully formatted. That is the failure mode this lesson addresses.

This session will teach you how to verify the scientific correctness of AI-generated code — even if you are not an experienced programmer. You do not need to write the code from scratch. You do need to understand it well enough to judge whether it is answering your actual research question.

🚨 Why Verification Matters More Than Generation

There is an asymmetry at the heart of AI-assisted coding that every researcher needs to understand: generating code is easy, but verifying that it answers the right scientific question is hard. AI has dramatically lowered the barrier to producing code that runs — modern agentic tools have largely automated the process of fixing crashes and syntax errors. What has not been automated is verifying that the code implements the right statistical logic for your specific research question. That judgment requires you.

AI-generated code looks authoritative. It is well-formatted, uses sensible variable names, includes comments, and often follows best practices for code structure. This surface-level quality creates a false sense of confidence. Code that looks professional — and runs flawlessly — is not necessarily code that is correct for your analysis.
Speed creates a verification debt. When you can generate ten analysis scripts in the time it used to take to write one, you have created ten scripts that need scientific review. If you do not budget time for verification, you are simply producing potentially wrong results faster. The time you save on writing should be reinvested in checking.
Errors in research code have real consequences. A bug in a web application might cause a button to misbehave. A wrong statistical test in research code might produce a significant result where none exists, or miss a real effect entirely. The stakes are different, and the verification standards must be different too.
You are responsible for your results, regardless of who or what wrote the code. No journal, no ethics board, and no thesis committee will accept "the AI wrote the code" as an excuse for incorrect results. The code runs under your name, and the results appear in your paper. You own them.

⚠️ The trust trap: With modern agentic tools handling syntax errors autonomously, the trust trap has shifted. The danger now is not code that crashes — the tool will fix that. The danger is code that runs on the first try, produces a polished table or plot, and gives you a p-value. That is when most people stop checking. But "ran without errors" and "computed the right answer to my research question" are very different things. Always verify the scientific logic of the output, not just whether it executed.

📊 The Verification Asymmetry

Consider a practical example: you ask an AI to compute the mean and standard deviation of a dataset grouped by experimental condition. The code runs, prints a neatly formatted table, and the numbers look plausible. But did the AI:

Use the correct column for grouping, or a similarly-named one?
Handle missing values correctly, or silently drop rows?
Compute the sample standard deviation (dividing by n-1) or the population standard deviation (dividing by n)?
Include or exclude outliers that should have been filtered?
Apply the grouping before or after a filtering step?

Every one of these questions can change your results. None of them will cause an error message. The code will run either way — it will just give you the wrong answer. This is why verification is not a nice-to-have. It is the core skill.

📖 Reading Code You Didn't Write

The single most effective verification technique is also the simplest: can you explain, in plain language, what each part of the code does? This is sometimes called the "explain it back" test, and it is your first and most important line of defence against errors in AI-generated code.

💡 The "Explain It Back" Test: Before you run any AI-generated code on your real data, go through it line by line (or block by block) and explain in your own words what each section does. If you cannot explain it, you do not understand it well enough to trust it. Ask the AI to explain the parts you do not understand — but then verify those explanations too.

This does not mean you need to understand every syntax detail. You do not need to memorise function signatures or know every argument by heart. But you do need to understand the logic of what the code is doing at each step. Here is a practical approach:

Read the code from top to bottom and annotate it. Add comments in your own words describing what each block does. "This section loads the data." "This filters out participants under 18." "This computes the correlation between variables X and Y." If you cannot write these annotations, you have found a gap in your understanding.
Identify the key decision points. Where does the code make choices that affect the results? Filtering conditions, grouping variables, statistical tests, handling of missing values — these are the places where errors are most likely and most consequential. Pay extra attention here.
Ask the AI to explain its choices. "Why did you use a t-test instead of a Mann-Whitney U test?" "Why are you dropping rows with missing values instead of imputing them?" "What does the axis=0 argument do here?" The quality of the AI's explanation will often reveal whether the choice was principled or arbitrary.
Check the explanation against your research design. Does the code implement what your methodology section describes? If your methods say "we used a two-tailed test" but the code uses a one-tailed test, that is a discrepancy that needs to be resolved before you proceed.
Trace the data flow. Follow a single data point from the raw input through every transformation to the final output. Does it go where you expect? Does it get modified in ways you intended? This is tedious but extremely effective at catching errors in data wrangling pipelines.

💡 A note on humility: There is no shame in not understanding a piece of code. The shame is in running code you do not understand and publishing results based on it. If you are using AI to generate code in a language or framework you are not familiar with, budget extra time for the verification step. The less you understand the language, the more carefully you need to check.

🔧 Practical Verification Techniques

Beyond reading the code, you need a toolkit of practical techniques for confirming that AI-generated code produces correct results. None of these techniques requires advanced programming skills — they require careful thinking and a willingness to test systematically.

🎯 Test with Known Data

Create a small dataset where you know what the correct answer should be, and run the code on that data first. If you are computing a mean, create five numbers and calculate the mean by hand. If you are running a regression, use a dataset where you know the relationship. If the code gives the wrong answer on data you understand, it will give the wrong answer on your real data too.

This is the single most powerful verification technique available to non-programmers. You do not need to read the code perfectly — you just need to know what the right answer looks like.

⚠️ Test Edge Cases

Edge cases are the boundary conditions where code is most likely to break. What happens when a group has only one observation? What happens with missing values? What about negative numbers, zeros, or extremely large values? What if a categorical variable has a category with no data?

AI-generated code often handles the "happy path" well but fails on edge cases. Testing these explicitly is how you find the bugs that will otherwise surface at the worst possible moment — after you have submitted your paper.

🧠 Sanity Checks

After every analysis step, ask: "Does this result make sense?" If your dataset has 500 participants but the summary shows 487 rows, where did the other 13 go? If the mean age is 3.7, something has gone wrong. If a correlation is exactly 1.0 on real data, something is suspicious.

Print intermediate results. Check the shape of your data at each step. Verify that counts, totals, and ranges match what you expect. These simple checks catch a remarkable number of errors.

📈 Compare with Established Tools

If you are running a statistical test, try running the same test in a different tool and comparing results. Compute a t-test in both Python and R, or compare your regression output with what Excel or SPSS produces. If the results match, you have good evidence that the code is correct. If they differ, you have found a problem worth investigating.

This cross-validation approach is especially valuable when you are using a tool or library you are not familiar with.

🔄 The Change-One-Thing Test

Modify one input value and check whether the output changes in the direction you expect. If you increase a participant's score, does the group mean increase? If you add an outlier, does the standard deviation get larger? If you change the grouping variable, do you get different groups?

This technique tests your understanding of the code's behaviour without requiring you to read every line. If the output does not respond to changes the way you expect, the code is not doing what you think it is doing.

🛠️ Write Simple Unit Tests

A unit test checks that a single function does what it claims. You can ask the AI to write unit tests for the code it generated, and then review those tests to make sure they are testing the right things. Better yet, write the test yourself: "Given this input, I expect this output." Then run it.

Even a handful of well-chosen tests dramatically increases your confidence in the code. If you are not sure what to test, focus on the functions that compute your key results — the ones that will appear in your paper.

👉 The golden rule of verification: If you cannot think of a way to check whether the code's output is correct, you are not ready to use that output in your research. Every result that goes into your paper should have been verified by at least one independent method. This is not paranoia — it is scientific rigour.

🐛 Common AI Code Failure Patterns

AI models tend to make specific, predictable types of errors in code. Knowing these patterns helps you focus your verification effort on the places where mistakes are most likely. Think of this as a checklist of the usual suspects — the errors you should actively look for every time you receive AI-generated code.

Failure Pattern	What Happens	How to Catch It
Variable Confusion	The AI uses a column name that does not exist in your data, or confuses two similarly-named variables (e.g., `score` vs `score_adjusted`). The code may still run if the wrong variable exists, silently computing results on the wrong data.	Print column names before and after each transformation. Check that the variables referenced in the code actually exist in your dataset and refer to what you think they do.
Off-by-One Errors	Loops or slicing operations include one too many or one too few elements. In Python, `range(1, 10)` gives you 1 through 9, not 1 through 10. These errors affect counts, indices, and aggregations.	Test with a small dataset where you can count the elements manually. Check that loop boundaries and slice ranges include exactly the elements you expect.
Wrong Statistical Test	The AI applies a parametric test when your data is not normally distributed, uses an independent-samples test when your data is paired, or applies a test that assumes equal variances when variances differ. The code runs without errors — it just gives you a meaningless p-value.	Before running any statistical test, verify that the assumptions of the test are met. Check the AI's choice of test against your research design and the properties of your data.
Missing Data Handling	The AI silently drops rows with missing values (changing your sample size), fills them with zeros (biasing your results), or ignores them in a way that produces misleading statistics. Different functions handle NaN differently, and the AI may not be consistent.	Check your row count before and after every operation. Explicitly ask: "How does this code handle missing values?" Compare the number of observations in your output with what you expect.
Aggregation Errors	The AI aggregates at the wrong level — computing a mean of means instead of an overall mean, double-counting observations, or grouping by the wrong variable. These errors are especially common in multi-level or longitudinal data.	Verify the grouping structure at each step. Check that the number of groups matches what you expect. Compute a simple aggregation by hand and compare it with the code's output.
Library Version Issues	The AI generates code that uses function signatures, argument names, or default behaviours from a different version of a library than what you have installed. A function that existed in pandas 1.x may have been renamed or deprecated in pandas 2.x. The AI's training data includes code from many versions.	Check your installed library versions. If you get unexpected errors or deprecation warnings, look up the function's documentation for your specific version. Pin your library versions in a requirements file.

🔍 Example: A Subtle Aggregation Error

Suppose you have data from 30 participants across 3 experimental conditions, with 10 observations per participant per condition. You ask AI to compute the mean reaction time per condition. A common mistake is to compute the mean of all 300 observations in each condition directly, rather than first computing each participant's mean and then averaging those participant-level means.

If participants contributed different numbers of valid trials (due to dropped outliers or missing responses), the direct approach gives more weight to participants with more trials. The participant-level approach gives equal weight to each participant. Both are valid choices, but they answer different questions and can give different results. The AI will pick one without asking you which you intended — and the code will run without complaint either way.

The fix: Always specify the aggregation structure you want. Check intermediate outputs to confirm the grouping is correct. When in doubt, compute both ways and see if the results differ — if they do, you need to make a principled decision about which is appropriate for your research question.

📂 Version Control and Reproducibility

Verification is not a one-time event. Code evolves as your analysis develops, and you need a system for tracking what changed, when, and why. Version control is not just a software engineering practice — it is a scientific practice that protects the reproducibility of your work.

Save every version of your analysis code. At minimum, keep dated copies of your scripts at each major milestone (after data cleaning, after main analysis, after revisions). Better yet, use a version control system like Git, which tracks every change automatically and lets you go back to any previous version.
Record the conversation. If you are using AI to generate or modify code, save the prompts and responses. This is part of your research record. If a reviewer asks "Why did you handle missing data this way?", you want to be able to show the reasoning process, not just the final code.
Pin your dependencies. Record the exact versions of every library and tool you used. In Python, this means a requirements.txt file. In R, this means recording your sessionInfo(). Code that works with one version of a library may produce different results with another version.
Document your verification steps. Keep a log of what you tested, how you tested it, and what you found. This is not just good practice — it is evidence that you did your due diligence. If questions arise later about the correctness of your results, your verification log is your defence.
Make your code runnable by someone else. The ultimate test of reproducibility is whether a colleague can take your code, your data, and your documentation and produce the same results you did. AI-generated code is often self-contained, which is a good start, but it still needs clear documentation about what to install, what data to provide, and what output to expect.

⚙️ A Minimal Reproducibility Checklist

Before you consider your analysis complete, confirm that you have:

A clean, well-commented analysis script that runs from start to finish without manual intervention. Someone reading it should be able to follow the logic without referring to your paper.
A record of your computing environment — Python/R version, library versions, operating system. Use pip freeze > requirements.txt or sessionInfo() to generate this automatically.
Documentation of your data — what each variable means, how missing values are coded, what units are used, and any preprocessing steps applied before the analysis script runs.
A verification log — what you tested, the expected vs actual results, and how you resolved any discrepancies. This can be as simple as a text file or a section of comments in your code.
An AI usage record — which parts of the code were AI-generated, what prompts you used, and what modifications you made. This supports both transparency and reproducibility.

💡 Why this matters for your career: Reproducibility failures are increasingly career-threatening. High-profile retractions often trace back to code errors that went undetected because no one checked. Building good version control and documentation habits now protects you from these risks and signals to collaborators and reviewers that you take computational rigour seriously.

📚 Readings and Resources

📑 Kapoor, S. & Narayanan, A. (2023). Leakage and the reproducibility crisis in machine-learning-based science.

https://doi.org/10.1016/j.patter.2023.100804

A landmark paper documenting how data leakage — a subtle and common code error — has led to widespread irreproducible results across scientific fields that use machine learning. The authors found the problem in hundreds of published papers across 17 scientific fields. Essential reading for understanding why code verification matters at a systemic level.

📑 Cheng, L., Li, X., & Bing, L. (2023). Is GPT-4 a Good Data Analyst?

https://arxiv.org/abs/2305.15038

An empirical evaluation of GPT-4's capability for data analysis tasks, examining where it succeeds and where it fails. The paper provides concrete examples of the kinds of errors AI makes when generating analysis code, making it directly relevant to the verification techniques covered in this lesson. Note that this is now pretty out of date!

📖 Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. — R for Data Science (2nd edition)

https://r4ds.hadley.nz/

The standard reference for data analysis in R, freely available online. Even if you primarily use Python, the principles of tidy data, reproducible workflows, and systematic data transformation are universal. The chapters on data import, transformation, and communication are especially relevant to verification practices.

📖 Good Research Code Handbook

https://goodresearch.dev/

A practical, opinionated guide to writing research code that is correct, reproducible, and maintainable. Covers project organisation, version control, testing, and documentation with a focus on academic researchers rather than software engineers. Highly recommended for anyone who wants to build good habits around code quality in research.

Key Takeaways

Verification is more important than generation. Anyone can generate code with AI. The skill that distinguishes rigorous research from sloppy research is the ability to verify that the code actually does what it claims. Budget at least as much time for verification as you do for generation.
The "explain it back" test is your first line of defence. If you cannot explain in plain language what each section of the code does, you do not understand it well enough to trust it. Read the code, annotate it, and trace the data flow before running it on real data.
Use practical verification techniques systematically. Test with known data, check edge cases, perform sanity checks, compare with established tools, try the change-one-thing test, and write unit tests. No single technique catches everything — use several in combination.
Know the common failure patterns. Variable confusion, off-by-one errors, wrong statistical tests, missing data mishandling, aggregation errors, and library version issues — these are the errors AI makes most often. Look for them actively every time.
Version control and reproducibility are not optional. Track your code changes, pin your dependencies, document your verification steps, and record your AI usage. Your future self, your collaborators, and your reviewers will thank you.
You are responsible for your results. The AI is a tool. You are the researcher. Every number that appears in your paper is your responsibility, regardless of who or what generated the code that produced it.

👉 Up next: In Sub-Lesson 5, we bring everything together with hands-on activities and the weekly assessment. You will apply the verification techniques from this lesson to real AI-generated analysis code, practise catching the failure patterns we covered, and build a verified, reproducible analysis workflow from start to finish.